8 research outputs found

    Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

    Full text link
    We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.Comment: Accepted to ICASSP 2020 (oral

    On lexical level matching

    Get PDF
    In many natural language understanding applications, text processing requires comparing lexical units: words, phrases, name entities and sentences. A significant amount of research has taken place in studying evaluating similarity metrics between those units. In this thesis, we summarize some research work in computing lexical similarity. We describe a new approach to compute similarity between two spans of text, using multiple semantic-units level comparison measures to compute sentence-level similarity scores

    Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

    Full text link
    Pretrained contextual word representations in NLP have greatly improved performance on various downstream tasks. For speech, we propose contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition. These representations come from the frame-wise intermediate representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken utterances. We first train the model on the Fisher English corpus with context-independent phoneme labels, then use its representations at inference time as features for task-specific models on the NIST LRE07 closed-set language recognition task and a Fisher speaker recognition task, giving significant improvements over the state-of-the-art on both (e.g., language EER of 4.68% on 3sec utterances, 23% relative reduction in speaker EER). Results remain competitive when using a novel dilated convolutional model for language recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201

    Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation

    Full text link
    Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED model

    Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

    Full text link
    Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LLMs and those used in E2E ASR. In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. Our experiments on fully-formatted E2E ASR transcription tasks across various domains demonstrate that our approach can effectively leverage the strengths of pretrained LLMs to produce more readable ASR transcriptions. Our model, which is based on the pretrained large language models with either an encoder-decoder or decoder-only structure, surpasses strong ASR models such as Whisper, in terms of recognition error rate, considering formats like punctuation and capitalization as well

    Induction chemotherapy‐based organ‐preservation protocol improve the function preservation compared with immediate total laryngectomy for locally advanced hypopharyngeal cancer—Results of a matched‐pair analysis

    No full text
    Abstract Background We performed a paired analysis to compare the therapeutic effect between the induction chemotherapy‐based organ‐preservation approach and immediate total laryngectomy in hypopharyngeal squamous cell carcinoma patients requiring total laryngectomy. Methods 351 patients who were treated with organ‐preservation approach were compared with 110 patients who were treated with total laryngectomy. The main measures and outcomes were progression‐free survival (PFS), overall survival (OS), and larynx function preservation survival (LFPS). Results No statistical difference was observed for 3‐, 5‐, and 10‐year PFS and OS in two groups. In the organ‐preservation group, the 3‐, 5‐, and 10‐year LFPS was 30.7%, 23.3%, and 16.6%, respectively. The LFPS of Stage III > Stage IV, N0 > N1 > N2 > N3, T2 > T3 > T4, CR > PR > SD > PD patients (all p values <0.05). Conclusions Survival outcomes did not significantly differ between the two groups. The organ‐preservation approach allowed more than 70% of the survivors to retain their larynx function

    LGR5 marks targetable tumor-initiating cells in mouse liver cancer

    Get PDF
    Cancer stem cells (CSCs) or tumor-initiating cells (TICs) are thought to be the main drivers for disease progression and treatment resistance across various cancer types. Identifying and targeting these rare cancer cells, however, remains challenging with respect to therapeutic benefit. Here, we report the enrichment of LGR5 expressing cells, a well-recognized stem cell marker, in mouse liver tumors, and the upregulation of LGR5 expression in human hepatocellular carcinoma. Isolated LGR5 expressing cells from mouse liver tumors are superior in initiating organoids and forming tumors upon engraftment, featuring candidate TICs. These cells are resistant to conventional treatment including sorafenib and 5-FU. Importantly, LGR5 lineage ablation significantly inhibits organoid initiation and tumor growth. The combination of LGR5 ablation with 5-FU, but not sorafenib, further augments the therapeutic efficacy in vivo. Thus, we have identified the LGR5+ compartment as an important TIC population, representing a viable therapeutic target for combating liver cancer
    corecore